Assignment 2 CSCN8000 Artificial Intelligence Algorithms and Mathematics
Sudhan Shrestha - 8889436

  1. Use iris flower dataset from sklearn library and try to form clusters of flowers using petal width and length features. Drop the other two features for simplicity.

    Figure out if any preprocessing such as scaling would help here

    Draw elbow plot and from that figure out optimal value of k

In [97]:
# importing various libraries and modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import zscore
from scipy import stats
from sklearn.metrics import accuracy_score
In [98]:
# loading the iris dataset
iris = load_iris(as_frame= True, return_X_y= False)
iris
Out[98]:
{'data':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
 0                  5.1               3.5                1.4               0.2
 1                  4.9               3.0                1.4               0.2
 2                  4.7               3.2                1.3               0.2
 3                  4.6               3.1                1.5               0.2
 4                  5.0               3.6                1.4               0.2
 ..                 ...               ...                ...               ...
 145                6.7               3.0                5.2               2.3
 146                6.3               2.5                5.0               1.9
 147                6.5               3.0                5.2               2.0
 148                6.2               3.4                5.4               2.3
 149                5.9               3.0                5.1               1.8
 
 [150 rows x 4 columns],
 'target': 0      0
 1      0
 2      0
 3      0
 4      0
       ..
 145    2
 146    2
 147    2
 148    2
 149    2
 Name: target, Length: 150, dtype: int32,
 'frame':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 0                  5.1               3.5                1.4               0.2   
 1                  4.9               3.0                1.4               0.2   
 2                  4.7               3.2                1.3               0.2   
 3                  4.6               3.1                1.5               0.2   
 4                  5.0               3.6                1.4               0.2   
 ..                 ...               ...                ...               ...   
 145                6.7               3.0                5.2               2.3   
 146                6.3               2.5                5.0               1.9   
 147                6.5               3.0                5.2               2.0   
 148                6.2               3.4                5.4               2.3   
 149                5.9               3.0                5.1               1.8   
 
      target  
 0         0  
 1         0  
 2         0  
 3         0  
 4         0  
 ..      ...  
 145       2  
 146       2  
 147       2  
 148       2  
 149       2  
 
 [150 rows x 5 columns],
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 'filename': 'iris.csv',
 'data_module': 'sklearn.datasets.data'}
In [99]:
# assigning the `iris.data` to the variable `X` and `iris.target` to the variable `y`.
X = iris.data
y = iris.target

print("Shape of X: ", X.shape)
print("Shape of : y", y.shape)
Shape of X:  (150, 4)
Shape of : y (150,)
In [100]:
X
Out[100]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

In [101]:
y
Out[101]:
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32
In [102]:
# dropping the columns 'sepal length (cm)' and 'sepal width (cm)' from the DataFrame `X`.
# The `axis=1` parameter specifies that the columns should be dropped. 
X = X.drop(['sepal length (cm)','sepal width (cm)'],axis=1)
X
Out[102]:
petal length (cm) petal width (cm)
0 1.4 0.2
1 1.4 0.2
2 1.3 0.2
3 1.5 0.2
4 1.4 0.2
... ... ...
145 5.2 2.3
146 5.0 1.9
147 5.2 2.0
148 5.4 2.3
149 5.1 1.8

150 rows × 2 columns

In [103]:
# defining StandardScaler and scaling X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
In [104]:
from sklearn.metrics import silhouette_score
# performing clustering using the K-means algorithm and calculating the silhouette score to evaluate the quality of the clustering.

# on un-scaled data
kmeans = KMeans(n_clusters=5, random_state=23)
cluster_labels = kmeans.fit_predict(X)
silhouette_no_scaling = silhouette_score(X, cluster_labels)

# on scaled data
kmeans_scaled = KMeans(n_clusters=5, random_state=23)
cluster_labels_scaled = kmeans_scaled.fit_predict(X_scaled)
silhouette_scaling = silhouette_score(X_scaled, cluster_labels_scaled)

print(f"Silhouette Score without scaling: {silhouette_no_scaling:.2f}")
print(f"Silhouette Score with scaling: {silhouette_scaling:.2f}")
Silhouette Score without scaling: 0.59
Silhouette Score with scaling: 0.57

The Silhouette Score without scaling is higher than the score with scaling, indicating that the clustering results are slightly better when scaling was not done to the data. However, the difference in scores is not substantial, and both scores are relatively close to each other.

In [105]:
import warnings
warnings.filterwarnings('ignore')


#  performing the K-means clustering and generating an elbow plot to determine the optimal number of clusters for the dataset.
wcss = []
max_clusters = 10  

for k in range(1, max_clusters + 1):
    kmeans = KMeans(n_clusters=k, random_state=23)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_) 

# Plot the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, max_clusters + 1), wcss, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares(WCSS)')
plt.title('The Elbow Plot')
plt.grid(True)
plt.show()

From the elbow plot it becomes evident that the optimal clustering for this is when k = 3.

  1. Use the heart dataset from the Resources Folder or access it from https://www.kaggle.com/fedesoriano/heart-failure-prediction

    Load heart disease dataset in pandas dataframe

    Remove outliers using Z score. Usual guideline is to remove anything that has Z score > 3 formula or Z score < -3

    Convert text columns to numbers using label encoding / one hot encoding

    Apply scaling

    Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy

    Now use PCA to reduce dimensions, retrain your model and see its impact on your model in terms of accuracy.

In [106]:
# reading a CSV file named 'heart.csv' and storing its contents in a pandas DataFrame called `df_heart`.
df_heart = pd.read_csv('csv/heart.csv')
df_heart.head()
Out[106]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
In [107]:
df_heart.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
In [108]:
df_heart.describe()
Out[108]:
Age RestingBP Cholesterol FastingBS MaxHR Oldpeak HeartDisease
count 918.000000 918.000000 918.000000 918.000000 918.000000 918.000000 918.000000
mean 53.510893 132.396514 198.799564 0.233115 136.809368 0.887364 0.553377
std 9.432617 18.514154 109.384145 0.423046 25.460334 1.066570 0.497414
min 28.000000 0.000000 0.000000 0.000000 60.000000 -2.600000 0.000000
25% 47.000000 120.000000 173.250000 0.000000 120.000000 0.000000 0.000000
50% 54.000000 130.000000 223.000000 0.000000 138.000000 0.600000 1.000000
75% 60.000000 140.000000 267.000000 0.000000 156.000000 1.500000 1.000000
max 77.000000 200.000000 603.000000 1.000000 202.000000 6.200000 1.000000
In [109]:
df_heart.shape
Out[109]:
(918, 12)
In [110]:
# box plot for the dataset
plt.figure(figsize=(15,10))
df_num = df_heart.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
    plt.subplot(4,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.boxplot(df_num[col], color='lightgreen')
    plt.tight_layout()
    plt.plot()
In [111]:
# performing outlier removal using z-scores.
df_without_outlier_zscore = df_heart.copy()
z_scores = np.abs(stats.zscore(df_without_outlier_zscore.select_dtypes(include=['int64', 'float64'])))
df_without_outlier_zscore = df_without_outlier_zscore[(z_scores < 3).all(axis=1)]
df_without_outlier_zscore.shape
Out[111]:
(899, 12)
In [112]:
# box plot for the dataset after outlier removal
plt.figure(figsize=(15,10))
df_num = df_without_outlier_zscore.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
    plt.subplot(4,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.boxplot(df_num[col], color='lightgreen')
    plt.tight_layout()
    plt.plot()
In [113]:
# creating a copy of the dataframe
df = df_without_outlier_zscore.copy()
df.head()
Out[113]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
In [114]:
labelencoder = LabelEncoder()
df['ChestPainType'] = labelencoder.fit_transform(df['ChestPainType'])
df['RestingECG'] = labelencoder.fit_transform(df['RestingECG'])
df['ST_Slope'] = labelencoder.fit_transform(df['ST_Slope'])
df['ExerciseAngina'] = labelencoder.fit_transform(df['ExerciseAngina'])
df['Sex'] = labelencoder.fit_transform(df['Sex'])
df.head()
Out[114]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 1 1 140 289 0 1 172 0 0.0 2 0
1 49 0 2 160 180 0 1 156 0 1.0 1 1
2 37 1 1 130 283 0 2 98 0 0.0 2 0
3 48 0 0 138 214 0 1 108 1 1.5 1 1
4 54 1 2 150 195 0 1 122 0 0.0 2 0
In [115]:
# performing scaling and separating the data for a machine learning model.
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']
scaler = StandardScaler()
X = scaler.fit_transform(X)
In [116]:
# splitting the data into training and testing sets.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=23)
In [117]:
from sklearn.model_selection import cross_val_score
# dictionary containing the various models.
models = {
    "Logistic Regression": LogisticRegression(),
    "SVC": SVC(kernel = 'rbf'),
    "Random Forest Classifier": RandomForestClassifier(), 
    }

# iterating over each model in the `models` dictionary.
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    print(f"The accuracy of {model_name} model: {acc*100:.2f}%")
The accuracy of Logistic Regression model: 82.78%
The accuracy of SVC model: 85.00%
The accuracy of Random Forest Classifier model: 84.44%

SCV model was found the be with the best accuracy, followed by Random Forest Classifier and Logistic Regresssion.

In [118]:
from sklearn.decomposition import PCA
# Applying PCA
n = X.shape[1]
pca = PCA(n_components=n) # applying on all features
X_pca = pca.fit_transform(X)
In [119]:
# splitting the PCA-transformed data into training and testing sets
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=23)
In [120]:
# re-calculating the accuracy of all the models
models = {
    "Logistic Regression": LogisticRegression(),
    "SVC": SVC(kernel = 'rbf'),
    "Random Forest Classifier": RandomForestClassifier(), 
    }
for model_name, model in models.items():
    model.fit(X_train_pca, y_train_pca)
    y_pred_pca = model.predict(X_test_pca)
    acc_pca = accuracy_score(y_test_pca, y_pred_pca)

    print(f"The accuracy of {model_name} model: {acc_pca*100:.2f}%")
The accuracy of Logistic Regression model: 82.78%
The accuracy of SVC model: 85.00%
The accuracy of Random Forest Classifier model: 80.56%

The results after PCA does not give any further improvements, the accuracy of Random Forest Model did decrease a bit from around 84.44% to 80.56%, where as the SVC and Logistic Regression models accuracy remains the same in our analysis. Similar, as before the SVM model's accuracy is still the best one out of all the three models.